Used Car Prediction

Objective: Reengineering after initial data preprocessing & feature engineering

Read Data

Change dtype to int

Summary of Data Without Target Variable

Price has extreme values in data.\ Car_age has -1 value, which should not exist. \ Odometer has extreme values in data.

Odometer

Anomaly Detection

Odometer shuould have max 200,000 miles. \ Standard cars in this day and age are expected to keep running up to 200,000 miles. \ But given summary statistics, it was shown that it has data over 200,000 upto 3736928711.

We will look at distribution for odometer.

As expected, it is highly skewed to the right so needs to handle outliers.

IQR limits:\ right tail: 75th quantile + 3 IQR\ left tail: 25th quantile - 3 IQR

Revisit the distribution after outlier trimming

Skewness is reduced to 0.3 and distribution looks approximately normal.

From the distribution, you can see that there is 0 for odometer value. Since it is a used car market, having 0 for odometer is weird.

Assuming new car can have 0 odometer. The below case makes sense.

This is subject to concern but possible too. Condition of 4 is like new.

Drop rows that have odometer 0 and condition less than 4 because if such the case, odometer should be above 0 as it is a used/droven car.

Look at summary stats again.

Now odometer has more reasonable stats now.

Distirubtion of the Price

Price is also another variable that shows extreme values. \ Since we are looking at used car market, based on the business purpose, we need to find cars that might not be so suitable for our analysis.

Skewness is 0.3 for price column.

We will drop cars that are below 1000 and above 200000. \ Cars below 1000 are highly likely posting errors especially with cars of recent years: seller might have capitalized on listing it for low price and getting buyers' attention. \ Cars above 200000: normally used car buyer would not look at purchasing 200000 priced cars. It is a price to buy new Porche.

Using feature_engine outlier trimmer, trim outliers on dataset.

Relook at distribution. Skewness increased for some reason but we will decide how to handle this after running the models. \ -> after running the models, we decided not to transform to reduce skewness. Reducing the skewness using log made models results worse.

MSRP

We found that some data has MSRP > price. MSRP is the initall selling price for new cars. \ Decided to drop abnormal data that has MSRP > price.

Drop quarter

Initially we transformed feature into quarter but realized that the dataset's quarter is all 2. This is not helpful for us to see any cyclical pattern so dropped it here.

Remove car age value -1 rows

Look at correlation between price and features

Original

Final After Re-engineering

Please refer to Modelling.ipynb for our modeling and metrics.